The Natality.csv contains information of Natality in each state. This includes the baby’s gender, birth weight, mother’s age and total number of births. The Census2015.cvs file contains the median age of the total population in each state.
Below is our sessionInfo().
sessionInfo(package=NULL)
source("../01 Data/ETL.R")
require(readr)
require(plyr)
file_path = "../01 Data/Natality.csv"
natality <- readr::read_csv(file_path)
names(natality)
df <- natality
names(df)
str(df) # Uncomment this line and run just the lines to here to get column types to use for getting the list of measures.
measures <- c("State_Code","Births","Average_Birth_Weight","Average_Age_Mother")
dimensions <- setdiff(names(df), measures)
dimensions
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
str(df)
na2emptyString <- function (x) {
x[is.na(x)] <- ""
return(x)
}
if( length(dimensions) > 0) {
for(d in dimensions) {
# Change NA to the empty string.
df[d] <- data.frame(lapply(df[d], na2emptyString))
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
na2zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if( length(measures) > 1) {
for(m in measures) {
print(m)
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
df[m] <- data.frame(lapply(df[m], na2zero))
df[m] <- lapply(df[m], function(x) as.numeric(as.character(x)))
#df[m] <- data.frame(lapply(df[m], as.numeric(as.character))) # This is needed to turn measures back to numeric because gsub turns them into strings.
}
}
str(df)
write.csv(df, gsub("Natality", "Natality", file_path), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n Natality \n")
if( length(measures) > 0 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 0 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
The cleaned up data file, Natality.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.
The data file containing the census data we are using, Census2015.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.
This visualization is a boxplot of state vs births. It is filtered by Race and Births and has the detail of the gender of the births. It is observed that California has the largest number of births for female and males followed by Texas.
This is the same data visualization done in Shiny.
This Treemap displays the count of the average age of mothers for each age in 0.5 bins, by race. It is colored by the count, with the race and average age as text. When the mouse is hovered over a box, it tells you the count number. For Asians, it is seen that most mothers are 30.5 years old, white mothers are 28, African Americas are 26, and Natives are 26.5. Asian mothers have babies the latest while African Americans have them the earliest.
White These Shiny histograms show the same result as the Tableau treemap, with the addition of a vertical average line for the average average-median-age for each race. The average line is close to the highest bar for each race except African American, where the average line is about a year later than the most common average-median-age.
This scatterplot compares the median age and number of births of each state. Each point signifies the median age of the state and number of births of each gender for a specific race within that state. The shape of the points classify the race and the color classifies the state. A trend line was added which shows Births = -682.411*Median_Age + 35676.1, meaning as the median age of the state increases, there appears to be a decrease in the number of births.